Decision trees and Ensemble Methods

DDD: Elements of Statistical Machine Learning & Politics of Data

Ayush Patel

At Azim Premji University, Bhopal

14 Feb, 2026

Hello

I am Ayush.

I am a researcher working at the intersection of data, development and economics.

I am a RStudio (Posit) certified tidyverse Instructor.

I am a Researcher at Oxford Poverty and Human development Initiative (OPHI), at the University of Oxford.

Did you come prepared?

  • You have installed R. If not see this link.

  • You have installed RStudio/Positron/VScode or any other IDE. It is recommended that you work through an IDE

  • You have the libraries {tree} {gbm} {randomForest} {BART} installed

Learning Goals

  1. What are decision trees?
  2. When to use them?
  3. How do these work?
  4. Application and interpretation.
  5. Improving decision trees using ensemble methods.

Decision Trees


Supervised and non-parametric

Can be used for Classification and Regression

A predictor space is cut into segments and mean response of training observations is used as the estimate of response of test observations.

Simple, easy to interpret but not the best for prediction accuracy on its own

Prediction accuracy can be improved by ensemble methods

Intuition - How it works

Can you Identify regions with similar salary range?

Intuition - How it works

Can you Identify regions with similar salary range?

Intuition - How it works

Can you Identify regions with similar salary range?

Intuition - How it works

Can you Identify regions with similar salary range?

Model output - Predictor Space

from ISLR

Model output - Tree

from ISLR

Region representation



\(R1 = \left\{X|Years<4.5\right\}\)

\(R2 = \left\{X|Years>=4.5,Hits<117.5\right\}\)

\(R2 = \left\{X|Years>=4.5,Hits>=117.5\right\}\)

Tree terminology

from ISLR

How should we carry out segmenting of predictor space?

Segmenting - Theory

Predictor Space of \(p\) variables needs to be segmented into \(J\) different regions.

In theory, the regions can be of any shape, however high-dimensional rectangles are chosen in practice for computational ease and interpretability

For every observation in region \(R_j\), we make the same prediction. Mean or mode of the training observations in region \(R_j\)

Minimize: \(\sum_{j=1}^{J}{\sum_{i \in R_j}{(y_i - \hat{y}_{R_j})^2}}\)

But

Not easy to consider all possible cutpoints for all possible predictors with all possible sequences

We use high-dimensional rectangles instead of any shape for ease of interpretation.

So, use Recursive Binary splitting

top-down

greedy

top-down



We brgin at the point where all observations are part of the same region. Hence the name top-down

Greedy



Best split at a particular step. We do not care about the future. A predictor \(p\) and a cutpoint \(s\) is chosen based on which split will lead to the lowest RSS.

This is carried out recursively, over and over again.

Formally

We aim to minimize the following at every step

\(\sum_{i:x_i \in R_1 (j,s)}{(y_i - \hat{y}_{R_1})^2} + \sum_{i:x_i \in R_2 (j,s)}{(y_i - \hat{y}_{R_2})^2}\)

Fitting Regresison Trees

from {palmerpenguins} website

Fitting Regresison Trees

tree(body_mass_g ~ ., data = penguins) -> peng_mass_tree

peng_mass_tree
node), split, n, deviance, yval
      * denotes terminal node

1) root 333 215300000 4207  
  2) species: Adelie,Chinstrap 214  40430000 3715  
    4) sex: female 107   8493000 3419 *
    5) sex: male 107  13240000 4010 *
  3) species: Gentoo 119  29670000 5092  
    6) sex: female 58   4519000 4680 *
    7) sex: male 61   5884000 5485 *

Fitting Regression Trees

peng_mass_tree$frame
      var   n       dev     yval splits.cutleft splits.cutright
1 species 333 215259666 4207.057            :ab              :c
2     sex 214  40428633 3714.720             :a              :b
4  <leaf> 107   8493224 3419.159                               
5  <leaf> 107  13241192 4010.280                               
3     sex 119  29674443 5092.437             :a              :b
6  <leaf>  58   4519321 4679.741                               
7  <leaf>  61   5884098 5484.836                               

Fitting Regression Trees

peng_mass_tree$where
  1   2   3   5   6   7   8  13  14  15  16  17  18  19  20  21  22  23  24  25 
  4   3   3   3   4   3   4   3   4   4   3   3   4   3   4   3   4   3   4   4 
 26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45 
  3   4   3   3   4   3   4   3   4   3   4   4   3   3   4   3   4   3   4   3 
 46  47  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66 
  4   4   3   4   3   4   3   4   3   4   3   4   3   4   3   4   3   4   3   4 
 67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86 
  3   4   3   4   3   4   3   4   3   4   3   4   3   4   3   4   3   4   3   4 
 87  88  89  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 
  4   3   4   3   3   4   3   4   3   4   3   4   3   4   3   4   3   4   3   4 
107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 
  3   4   3   4   3   4   3   4   3   4   3   4   3   4   3   4   3   4   3   4 
127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 
  3   4   3   4   3   4   3   4   3   4   3   4   3   4   3   4   3   4   3   4 
147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 
  4   3   3   4   3   4   6   7   6   7   7   6   6   7   6   7   6   7   6   7 
167 168 169 170 171 172 173 174 175 176 177 178 180 181 182 183 184 185 186 187 
  6   7   6   7   6   7   7   6   6   7   6   7   7   6   7   7   6   6   7   6 
188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 
  7   6   7   6   7   6   7   6   7   7   6   6   7   6   7   6   7   6   7   6 
208 209 210 211 212 213 214 215 216 217 218 220 221 222 223 224 225 226 227 228 
  7   6   7   6   7   6   7   6   7   6   7   7   6   7   6   7   7   6   6   7 
229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 
  6   7   6   7   6   7   6   7   6   7   6   7   6   7   6   7   6   7   6   7 
249 250 251 252 253 254 255 256 258 259 260 261 262 263 264 265 266 267 268 270 
  7   6   6   7   6   7   6   7   7   6   7   6   7   6   7   6   7   6   7   7 
271 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 
  6   6   7   6   7   3   4   4   3   4   3   3   4   3   4   3   4   3   4   3 
292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 
  4   4   3   3   4   3   4   3   4   3   4   3   4   3   4   3   4   3   4   4 
312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 
  3   3   4   3   4   4   3   4   3   3   4   3   4   4   3   3   4   3   4   3 
332 333 334 335 336 337 338 339 340 341 342 343 344 
  4   3   4   4   3   4   3   3   4   3   4   4   3 

Fitting Regression Trees

summary(peng_mass_tree)

Regression tree:
tree(formula = body_mass_g ~ ., data = penguins)
Variables actually used in tree construction:
[1] "species" "sex"    
Number of terminal nodes:  4 
Residual mean deviance:  97680 = 32140000 / 329 
Distribution of residuals:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-760.30 -219.20   15.16    0.00  220.30  815.20 

Fitting Regression Trees

plot(peng_mass_tree)
text(peng_mass_tree, pretty = 0)

Fitting Regression Trees

na.omit(penguins) |>
    mutate(
        pred_mass = predict(peng_mass_tree)
    ) |>
        relocate(body_mass_g, pred_mass, everything())
# A tibble: 333 × 9
   body_mass_g pred_mass species island    bill_length_mm bill_depth_mm
         <int>     <dbl> <fct>   <fct>              <dbl>         <dbl>
 1        3750     4010. Adelie  Torgersen           39.1          18.7
 2        3800     3419. Adelie  Torgersen           39.5          17.4
 3        3250     3419. Adelie  Torgersen           40.3          18  
 4        3450     3419. Adelie  Torgersen           36.7          19.3
 5        3650     4010. Adelie  Torgersen           39.3          20.6
 6        3625     3419. Adelie  Torgersen           38.9          17.8
 7        4675     4010. Adelie  Torgersen           39.2          19.6
 8        3200     3419. Adelie  Torgersen           41.1          17.6
 9        3800     4010. Adelie  Torgersen           38.6          21.2
10        4400     4010. Adelie  Torgersen           34.6          21.1
# ℹ 323 more rows
# ℹ 3 more variables: flipper_length_mm <int>, sex <fct>, year <int>

Do it yourself


Run the linear model using the penguins data:
\(bodymass = species + sex\)
compare the RSS for the linear model with the RSS of the decision tree.

Cant just keep splitting


some terminal nodes have very few observations

node), split, n, deviance, yval
      * denotes terminal node

 1) root 263 53320000  535.9  
   2) CHits < 450 117  5931000  227.9  
     4) AtBat < 147 5  2940000  709.5 *
     5) AtBat > 147 112  1779000  206.4  
      10) CRBI < 114.5 74   302100  141.8 *
      11) CRBI > 114.5 38   567200  332.1 *
   3) CHits > 450 146 27390000  782.8  
     6) Walks < 61 104  9470000  649.6  
      12) AtBat < 395.5 53  2859000  510.0 *
      13) AtBat > 395.5 51  4504000  794.7  
        26) PutOuts < 771 45  2358000  746.4 *
        27) PutOuts > 771 6  1255000 1157.0 *
     7) Walks > 61 42 11500000 1113.0  
      14) RBI < 73.5 22  3148000  885.3  
        28) PutOuts < 239.5 7  1739000 1156.0 *
        29) PutOuts > 239.5 15   656300  758.9 *
      15) RBI > 73.5 20  5967000 1363.0  
        30) Years < 13.5 14  3767000 1521.0  
          60) CAtBat < 3814.5 8   529600 1141.0 *
          61) CAtBat > 3814.5 6   541500 2028.0 *
        31) Years > 13.5 6  1026000  992.5 *

Cant just keep splitting

So, when to stop


The recursive splitting approach is likely to overfit

Leads to a complex tree

A realtively smaller tree can avoid overfitting via lower variance. Providing better interpretation at the cost of little bias

One could say that we split only if reduction in RSS is larger than some value. But this can be short sighted

So what to do


Grow the full Tree.

Prune it back to a smaller subtree

But which is the best subtree

Well, intuitively, one that leads to the lowest test error rate

Finding the best subtree


We can use cross-validation to estimate test error rate

But there can be so many subtrees

So, we can select some small number of trees using cost complexity pruning or weakest link pruning

Cost Complexity Pruning



\(\sum_{m=1}^{|T|}{\sum_{i:x_i \in R_m}{(y_i - \hat{y}_{R_m})^2}} + \alpha |T|\)

The Combined Algorithm

  1. Use Recursive binary splitting to grow a large tree.
  2. Apply cost complexity pruning to get a sequence of best subtrees, as a function of \(\alpha\)
  3. Use K-fold CV to choose \(\alpha\).
    1. Repeat steps 1 and 2 on all but the Kth fold of training data
    2. Evaluate the mse on the left-out Kth fold, as a fucntion of \(\alpha\). Average results of each value of \(\alpha\), choose ove that minimizes average error.
  4. Return the subtree from step 2 that corresponds with the value of chosen \(\alpha\)

Step 1 - Grow the full tree

rsample::initial_split(data = Hitters,prop = .75) -> hit_split
rsample::training(hit_split) -> train_tree
rsample::testing(hit_split) -> test_tree

tree(Salary ~ ., data = train_tree) -> sal_hit_tree

Step 2 - Apply Cost Complexity Pruning

prune.tree(sal_hit_tree)
$size
 [1] 12 11 10  9  8  6  5  4  3  2  1

$dev
 [1] 10103044 10453308 10810982 11357152 11913936 13180950 13937113 14864006
 [9] 16751384 20426590 34883922

$k
 [1]       -Inf   350264.5   357674.1   546169.4   556784.6   633507.1
 [7]   756162.4   926893.5  1887377.9  3675205.6 14457332.2

$method
[1] "deviance"

attr(,"class")
[1] "prune"         "tree.sequence"

Step 3 - Apply CV to choose \(\alpha\)

cv.tree(sal_hit_tree,FUN = prune.tree) -> cv_pruned_estimates

cv_pruned_estimates
$size
 [1] 12 11 10  9  8  6  5  4  3  2  1

$dev
 [1] 23039605 22865159 22865159 23120219 23136553 22185870 21241115 21035958
 [9] 20490343 23071670 35050196

$k
 [1]       -Inf   350264.5   357674.1   546169.4   556784.6   633507.1
 [7]   756162.4   926893.5  1887377.9  3675205.6 14457332.2

$method
[1] "deviance"

attr(,"class")
[1] "prune"         "tree.sequence"

Step 3 - Apply CV to choose \(\alpha\)

tibble(
    leaves = cv_pruned_estimates$size,
    rss = cv_pruned_estimates$dev
) |>
    ggplot(aes(leaves, rss)) +
    geom_point()+
    geom_line()+
    theme_minimal()

Step 3 - Apply CV to choose \(\alpha\)

Step 4 - get the subtree from the large tree

prune.tree(sal_hit_tree, best = 3) -> best_sub_tree_sal

plot(best_sub_tree_sal)
text(best_sub_tree_sal, pretty = 0)

Lets predict using the model

predict(best_sub_tree_sal, test_tree)
     -Andre Dawson   -Argenis Salazar     -Andres Thomas      -Alan Wiggins 
          913.5872           228.4322           228.4322           541.5909 
      -Billy Beane        -Buddy Bell     -Bobby Bonilla      -Brett Butler 
          228.4322           913.5872           228.4322           913.5872 
       -Bob Horner      -Bill Madlock       -Ben Oglivie       -Bip Roberts 
          913.5872           541.5909           541.5909           228.4322 
 -BillyJo Robidoux    -Craig Reynolds       -Cory Snyder   -Dann Bilardello 
          228.4322           541.5909           228.4322           228.4322 
     -Daryl Boston      -Dave Collins    -Darren Daulton       -Donnie Hill 
          228.4322           541.5909           228.4322           228.4322 
     -Dave Kingman     -Don Mattingly       -Dale Murphy -Darryl Strawberry 
          913.5872           913.5872           913.5872           913.5872 
      -Enos Cabell      -Eddie Murray      -Ernest Riles       -George Bell 
          541.5909           913.5872           228.4322           913.5872 
       -Greg Brock       -Gary Gaetti        -Garth Iorg       -Gary Pettis 
          228.4322           913.5872           541.5909           228.4322 
        -Hal McRae   -Harold Reynolds   -Herm Winningham     -Juan Beniquez 
          541.5909           228.4322           228.4322           541.5909 
   -John Cangelosi        -Jody Davis         -Jim Dwyer   -Jeffrey Leonard 
          228.4322           913.5872           541.5909           541.5909 
      -Joe Orsulak        -Jorge Orta       -Jamie Quirk          -Jim Rice 
          228.4322           541.5909           228.4322           913.5872 
       -Jim Traber        -Jose Uribe     -Jerry Willard   -Keith Hernandez 
          228.4322           228.4322           228.4322           913.5872 
    -Ken Landreaux  -Kevin McReynolds     -Larry Herndon      -Lonnie Smith 
          541.5909           913.5872           541.5909           913.5872 
     -Lou Whitaker   -Mike Fitzgerald     -Milt Thompson     -Mitch Webster 
          913.5872           228.4322           228.4322           228.4322 
    -Mookie Wilson        -Mike Young   -Oddibe McDowell       -Omar Moreno 
          541.5909           228.4322           228.4322           541.5909 
      -Phil Garner      -Pete O'Brien         -Pete Rose      -Pat Sheridan 
          541.5909           913.5872           541.5909           228.4322 
       -Pat Tabler          -Rob Deer        -Ron Kittle        -Ray Knight 
          913.5872           228.4322           228.4322           913.5872 
       -Rick Leach        -Ron Oester     -Ryne Sandberg         -Sid Bream 
          228.4322           913.5872           913.5872           228.4322 
   -Tony Bernazard      -Tom Brookens       -Toby Harrah        -Tim Hulett 
          913.5872           541.5909           541.5909           228.4322 
    -Terry Kennedy      -Tito Landrum       -Tim Laudner        -Terry Puhl 
          228.4322           228.4322           228.4322           541.5909 
    -Wally Backman 
          541.5909